70 research outputs found
Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)
Computing systems have become increasingly complex with the emergence of
heterogeneous hardware combining multicore CPUs and GPUs. These parallel
systems exhibit tremendous computational power at the cost of increased
programming effort. This results in a tension between achieving performance and
code portability. Code is either tuned using device-specific optimizations to
achieve maximum performance or is written in a high-level language to achieve
portability at the expense of performance.
We propose a novel approach that offers high-level programming, code
portability and high-performance. It is based on algorithmic pattern
composition coupled with a powerful, yet simple, set of rewrite rules. This
enables systematic transformation and optimization of a high-level program into
a low-level hardware specific representation which leads to high performance
code.
We test our design in practice by describing a subset of the OpenCL
programming model with low-level patterns and by implementing a compiler which
generates high performance OpenCL code. Our experiments show that we can
systematically derive high-performance device-specific implementations from
simple high-level algorithmic expressions. The performance of the generated
OpenCL code is on par with highly tuned implementations for multicore CPUs and
GPUs written by expertsComment: Technical Repor
Using machine-learning to efficiently explore the architecture/compiler co-design space
Designing new microprocessors is a time consuming task. Architects rely on slow simulators to
evaluate performance and a significant proportion of the design space has to be explored before
an implementation is chosen. This process becomes more time consuming when compiler
optimisations are also considered. Once the architecture is selected, a new compiler must be
developed and tuned. What is needed are techniques that can speedup this whole process and
develop a new optimising compiler automatically.
This thesis proposes the use of machine-learning techniques to address architecture/compiler
co-design. First, two performance models are developed and are used to efficiently search the
design space of amicroarchitecture. These models accurately predict performance metrics such
as cycles or energy, or a tradeoff of the two. The first model uses just 32 simulations to model
the entire design space of new applications, an order of magnitude fewer than state-of-the-art
techniques. The second model addresses offline training costs and predicts the average behaviour
of a complete benchmark suite. Compared to state-of-the-art, it needs five times fewer
training simulations when applied to the SPEC CPU 2000 and MiBench benchmark suites.
Next, the impact of compiler optimisations on the design process is considered. This has
the potential to change the shape of the design space and improve performance significantly. A
new model is proposed that predicts the performance obtainable by an optimising compiler for
any design point, without having to build the compiler. Compared to the state-of-the-art, this
model achieves a significantly lower error rate.
Finally, a new machine-learning optimising compiler is presented that predicts the best
compiler optimisation setting for any new program on any new microarchitecture. It achieves
an average speedup of 1.14x over the default best gcc optimisation level. This represents 61%
of the maximum speedup available, using just one profile run of the application
Generating Fast Sparse Matrix Vector Multiplication From a High Level Generic Functional IR
Usage of high-level intermediate representations promises the generation of fast code from a high-level description, improving the productivity of developers while achieving the performance traditionally only reached with low-level programming approaches.
High-level IRs come in two flavors: 1) domain-specific IRs designed only for a specific application area; or 2) generic high-level IRs that can be used to generate high-performance code across many domains. Developing generic IRs is more challenging but offers the advantage of reusing a common compiler infrastructure across various applications.
In this paper, we extend a generic high-level IR to enable efficient computation with sparse data structures. Crucially, we encode sparse representation using reusable dense building blocks already present in the high-level IR. We use a form of dependent types to model sparse matrices in CSR format by expressing the relationship between multiple dense arrays explicitly separately storing the length of rows, the column indices, and the non-zero values of the matrix.
We achieve high-performance compared to sparse low-level library code using our extended generic high-level code generator. On an Nvidia GPU, we outperform the highly tuned Nvidia cuSparse implementation of spmv multiplication across 28 sparse matrices of varying sparsity on average by 1.7Ă
High Performance Stencil Code Generation with LIFT
Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers.
Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Liftâs key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains.
This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes
- âŠ